Information Extraction from the Web

نویسندگان

  • Wolfgang May
  • Georg Lausen
  • Georges Koehler
چکیده

The goal of information extraction from the Web is to provide an integrated view on data from autonomous heterogeneous information sources The main problem with current wrap per mediator approaches is that they rely on very di erent formalisms and tools for wrappers and mediators thus leading to an impedance mismatch between the wrapper and mediator level Additionally most approaches nowadays are restricted to access information only from a xed set of sources On the other hand generic Web querying approaches are restricted to pure syntactical and structural queries and do not deal with semantical issues In this paper we discuss an integrated architecture for Web exploration wrapping media tion and querying Our system is based on a uni ed framework i e data model and language in which all tasks are performed We regard the Web and its contents as a unit represented in an object oriented data model the Web structure given by its hyperlinks the parse trees of Web pages and its contents are all included in the internal world model of the system The advantage of this uni ed view is that the same data manipulation and querying language can be used for the Web structure and the application level model The model is complemented by a rule based object oriented language which is extended by Web access capabilities and structured document analysis Thus accessing Web pages wrapping mediating and querying information can be done using the same language This integration also allows for data driven Web exploration which is independent from a given network of individual prede ned wrappers and mediators Thus in addition to the classical wrapper and mediator functionality a system with this architecture can be equipped with Web navigation and exploration functionality Queries to existing Web indexing and searching engines can also be integrated In particular we present a methodology for reusing generic rule patterns for typical extrac tion integration and restructuring tasks using this framework In an abstract sense the system contains a universal wrapper which can be applied to arbitrary Web pages that the system learns about during information processing Equipped with suitably intelligent rules the sys tem can potentially explore initially unknown parts of the Web thus coping with the steady growth of the Web We show the practicability of our approach by using the Florid system HKL The approach is illustrated by two case studies

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Assessing the Internal Structure of the Ellis Information Retrieval Model in Order to Present the Persian Norm of Web Retrieval Tools

Introduction: Study evaluated the internal structure of Ellis information seeking model in the student community with the aim of presenting the Persian norm. Methods: This is a descriptive-analytical study conducted by cross-sectional survey method in the second semester of the academic year 1399-1400. Population comprise of 280 graduate students at Ahvaz Jundishapur University of Medical Scien...

متن کامل

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000